Named Entity Recognition¶
Named Entity Recognition (NER) is a basic Information extraction task in which words (or phrases) are classified into pre-defined entity groups (or marked as non interesting). Entity groups share common characteristics of consisting words or phrases and are identifiable by the shape of the word or context in which they appear in sentences. Examples of entity groups are: names, numbers, locations, currency, dates, company names, etc.
John is planning a visit to London on October | | | Name City Date
In this example, a
date entities are identified.
In this example we used publicly available NER datasets used in common research papers. The data must be divided into train and test sets, preprocessed and tokenized and tagged with a finite set of entities in BIO format.
The dataset files must be processed into tabular format where each entry is of the following format:
<token> <tag_1> ... <tag_n>
In the above format each sentence is separated by an empty line. Each line consists of a single sentence tokens with tags divided by white spaces (or any whitespace dividers).
Loading data into the model can be done using the
SequentialTaggingDataset data loader which can be used with the prepared train and test data sets described above.
The data loader returns 2 Numpy matrices: 1. sparse word representation of the sentence words 2. sparse word character representation of sentence words
The user has a choice to use any representation or both when training models.
The NER model is based on the Bidirectional LSTM with Conditional Random Field sequence classifier published in a paper by Lample et al.
The model has 2 inputs:
- sentence words - converted into dense word embeddings or loaded from an external pre-trained word embedding model.
- character embedding - trained using the words of the sentences.
A high level overview of the model is provided in figure below:
NER words or phrases can sometimes be easily identified by the shape of the words, by pre-built lexicons, by Part-of-speech analysis or rules combining patterns of the above features. In many other cases, those features are not known or non existent and the context in which the words appear provide the indication whether a word or a phrase is an entity.
With the help of RNN topologies we can use LSTMs to extract the character based features of words. In this model we use convolutions to extract n-grams features from the characters making up words. A similar approach with RNNs takes the last state of a BiLSTM layer as a representation of the character embeddings. More info on character embedding can be found in the paper.
The main tagger model consists of a bidirectional LSTM layers. The input of the LSTM layers consists of a concatenation of the word embedding vector and the character embedding vector (provided by the character embedding network).
Finally, the output of the LSTM layers are merged into a fully-connected layer (for each token) and fed into a Conditional Random Field classifier. Using CRF has been empirically shown to provide more accurate models when compared to single token prediction layers (such as a softmax layer).
Train a model with default parameters given input data files:
python examples/ner/train.py --train_file train.txt --test_file test.txt
Full training parameters¶
All customizable parameters can be obtained by running:
python examples/ner/train.py -h
|-h, --help||show this help message and exit|
|-b B||Batch size|
|-e E||Number of epochs|
|Train file (sequential tagging dataset format)|
|Test file (sequential tagging dataset format)|
|Entity labels tab number in train/test files|
|Max sentence length|
|Max word length in characters|
|Word features embedding dimension size|
|Character features embedding dimension size|
|Character feature extractor LSTM dimension size|
|Entity tagger LSTM dimension size|
|Path to external word embedding model file|
|Path for saving model weights|
|Path for saving model topology|
|--use_cudnn||use CUDNN based LSTM cells|
The model will automatically save the model weights and topology information after training is complete (user can provide file names as above).
interactive.py file enables using a pre-trained model in interactive mode, providing input directly from stdin.
python examples/ner/interactive.py -h for a full list of options:
|Path of model weights|
|Path of model topology|
python examples/ner/interactive.py --model_path model.h5 --model_info_path model_info.dat